About the project

First day

Starting a bit late, but very hopeful! I am really interested in the topic, but technical difficulties are driving me insane. Also very lost for now.


Regression and model validation

Data exploration

Read the data

learning2014 <- read.csv("./data/learning2014.csv")

For data structure:

dim(learning2014)
## [1] 166   8
str(learning2014)
## 'data.frame':    166 obs. of  8 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
##   X gender age attitude     deep  stra     surf points
## 1 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6 6      F  38      3.8 4.750000 3.625 2.416667     21

Explore the dataset

summary(learning2014)
##        X          gender       age           attitude          deep      
##  Min.   :  1.00   F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  1st Qu.: 42.25   M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Median : 83.50           Median :22.00   Median :3.200   Median :3.667  
##  Mean   : 83.50           Mean   :25.51   Mean   :3.143   Mean   :3.680  
##  3rd Qu.:124.75           3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##  Max.   :166.00           Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Create a subset

library(dplyr)
keep <- c("Country", "Edu2.FM", "Labo.FM", "Life.Exp", "Edu.Exp", "GNI", "Mat.Mor", "Ado.Birth", "Parli.F")
human <- select(human, one_of(keep))

Missing values

In R, NA stands for not available, which means that the data point is missing. If a variable you wish to analyse contains missing values, there are usually two main options:
1. Remove the observations with missing values
2. Replace the missing values with actual values using an imputation technique.

Option 1:

# print out a completeness indicator of the 'human' data
complete.cases(human)

# print out the data along with a completeness indicator as the last column
data.frame(human[-1], comp = complete.cases(human))

# filter out all rows with NA values
human_ <- filter(human, complete.cases(human))

Excluding observations

# look at the last 10 observations of human
tail(human, 10)

# define the last indice we want to keep
last <- nrow(human) - 7

# choose everything until the last 7 observations
human_ <- human[1:last, ]

# add countries as rownames
rownames(human_) <- human_$Country

Graphical exploration

library(GGally)
## Loading required package: ggplot2
library(ggplot2)
pairs(learning2014[-1], col = learning2014$gender)

p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

Probability plots

Before starting any analysis, it is good to look at the probability plots of each variable.

Options:
1. qqnorm( ) function to create a Quantile-Quantile plot evaluating the fit of sample data to the normal distribution.
2. The fitdistr( ) function in the MASS package provides maximum-likelihood fitting of univariate distributions. The format is fitdistr(x, densityfunction) where x is the sample data and density function is one of the following: “beta”, “cauchy”, “chi-squared”, “exponential”, “f”, “gamma”, “geometric”, “log-normal”, “lognormal”, “logistic”, “negative binomial”, “normal”, “Poisson”, “t” or “weibull”.
3. For other, see: https://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf

If the distribution does satisfy the normality criterion, log-transform the variable to identify outliers more clearly.
Delete those observations.
Try step 1 again.

Boxplots for numerical variables

library(ggplot2)
  1. Create a dummy variable with the same mean and standard deviation, but normally distributed, for comparison purposes:

    absences_norm <- rnorm(200,mean=mean(absences, na.rm=TRUE), sd=sd(absences, na.rm=TRUE))
  2. Plot the two variables side by side

    boxplot(absences, absences_norm, main = "Boxplots of the absences variable: actual distribution vs. normal", at = c(1,2), names = c("absences","absences_norm"), las = 2, col = c("pink","yellow"), horizontal = TRUE)

Сorrelation plots

library(tidyverse), library(MASS)

Option 1

  1. Calculate the correlation matrix and round it (tidyverse and corrplot packages)

    cor_matrix <- cor(Boston) 
    cor_matrix_r <- cor_matrix %>% round(2)
    # округлить до 2 цифр после запятой
  2. Visualize the correlation matrix

    corrplot(cor_matrix_r, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

    Option 2

  3. Use pairs() command (or ggpairs with GGally package)
  4. With pairs() you can reduce the number of pairs to see the plots more clearly. After the data argument, you have to specify the columns you want to see. Ex.: pairs(Boston[6:10]), col = km$cluster)

Сenter and standardize numerical variables

scale()
boston_scaled <- scale(Boston), # mean = 0, центрирует вокруг нуля
class(boston_scaled) # в результате шкалирования получается матрица
boston_scaled <- as.data.frame(boston_scaled) # меняем матрицу на data frame

Create a categorical variable form a numerical one

Quantiles
1. Create a quantile vector of crim and print it

bins <- quantile(boston_scaled$crim)
bins
  1. Create a categorical variable ‘crime’

    crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low","med_low","med_high","high"))
  2. Look at the table of the new factor crime

    table(crime)
  3. Remove original crim from the dataset

    boston_scaled <- dplyr::select(boston_scaled, -crim)
  4. Add the new categorical value to scaled data

    boston_scaled <- data.frame(boston_scaled, crime)

Linear regression

We are now interested in exploring whether we can explain the student exam grade by relating it with other available variables. A multiple regression model is fitted with a response variable ‘points’ and explanatory variables ‘attitude’, ‘stra’ and ‘surf’.

lm <- lm(points ~ attitude + stra + surf, data = learning2014)
lm
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Coefficients:
## (Intercept)     attitude         stra         surf  
##     11.0171       3.3952       0.8531      -0.5861
summary(lm)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The model with three explanatory variables is able to account for 21% of the variance (multiple R squared). It shows that the only variable with a statistically significant relationship with examn grade is ‘attitude’. Now we have to try out and take the two other factors one by one in a search of a more parsimonious model.

lm1 <- lm(points ~ attitude + stra, data = learning2014)
lm1
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Coefficients:
## (Intercept)     attitude         stra  
##      8.9729       3.4658       0.9137
summary(lm1)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09
lm2 <- lm(points ~ attitude + surf, data = learning2014)
lm2
## 
## Call:
## lm(formula = points ~ attitude + surf, data = learning2014)
## 
## Coefficients:
## (Intercept)     attitude         surf  
##      14.120        3.426       -0.779
summary(lm2)
## 
## Call:
## lm(formula = points ~ attitude + surf, data = learning2014)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.277  -3.236   0.386   3.977  10.642 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.1196     3.1271   4.515 1.21e-05 ***
## attitude      3.4264     0.5764   5.944 1.63e-08 ***
## surf         -0.7790     0.7956  -0.979    0.329    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 163 degrees of freedom
## Multiple R-squared:  0.1953, Adjusted R-squared:  0.1854 
## F-statistic: 19.78 on 2 and 163 DF,  p-value: 2.041e-08

Taking out the ‘surf’ or ‘stra’ variables did not affect the multiple R squared and thus the fit of the model. It makes it safe to conclude that they do not affect the exam grade and we can proceed with the only significant factor ‘attitude’.

lm3 <- lm(points ~ attitude, data = learning2014)
lm3
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Coefficients:
## (Intercept)     attitude  
##      11.637        3.525
summary(lm3)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The fitted model explains 19% of the variance, the attitude is concluded to be a statistically significant predictor for the exam grade.

Residual analysis

Next, to make sure that the model is fitted correctly, residuals analysis is due.

Residuals vs. Fitted values

plot(lm3, which = c(1))

QQ plot (theoretical quantiles)

plot(lm3, which = c(2))

Residuals vs. Leverage

plot(lm3, which = c(5))

The residuals vs. fitted values are equally distributed and do not demonsrate any pattern. The QQ plot shows a nice fit along the line with no cause for concern. The Leverage plot does nor show any particular outliers.
The model is valid.

Prediction. Train and test sets

  1. Determine the number of rows in the dataset

    n <- nrow(boston_scaled)
  2. Choose randomly 80% of the rows

    ind <- sample(n,  size = n * 0.8)
  3. Create train set

    train <- boston_scaled[ind,]
  4. Create test set by substraction the train set

    test <- boston_scaled[-ind,]
  5. Save the correct classes from test data

    correct_classes <- test$crime (any variable used for prediction)
  6. Remove this variable from test data

    test <- dplyr::select(test, -crime)


Chapter 3. Logistic regression. Analysis

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features; it was collected by using school reports and questionnaires.

alc <- read.csv("./data/alc.csv")

The dataset contains 35 variables and 382 observations pertaining to the performance, family, personal life and other information about students and their consumption of alcohol.

Here, I am interested in studying the relationships between high/low alcohol consumption and other variables in the data, namely, four of them: ‘absences’ (number of school absences), ‘traveltime’ (home to school travel time), ‘goout’ (going out with friends), and ‘famrel’ (quality of family relationships).

I hypothesize that the high number of absences and a lot of time with friends along with bad family relationships and short traveltime will increase odds of high alcohol consumption in students.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
keep_columns <- c("sex", "age", "absences", "traveltime", "goout", "famrel", "high_use")
alc <- select(alc, one_of(keep_columns))
sex <- alc$sex
age <- alc$age
absences <- alc$absences
traveltime <- alc$traveltime
goout <- alc$goout
famrel <- alc$famrel
high_use <- alc$high_use

We have now reformatted the dataset to only keep the columns pertaining to the hypothesis and two general variables, sex and age. Now onto the graphical exploration.

library(ggplot2)
# create a dummy variable with the same mean and standard deviation, but normally distributed, for comparison purposes
absences_norm <- rnorm(200,mean=mean(absences, na.rm=TRUE), sd=sd(absences, na.rm=TRUE))
# plot the two variables side by side
boxplot(absences, absences_norm, main = "Boxplots of the absences variable: actual distribution vs. normal", at = c(1,2), names = c("absences","absences_norm"), las = 2, col = c("pink","yellow"), horizontal = TRUE)

First, we need to analyze the only numeric variable, the number of absences. As seen in the boxplot, the actual distribution is within normal limits (with no negative values) and with a few outliers. Before excluding those observations, we will explore other variable and try and fit the model.
Now we will plot the interval variables with a scale of answers from 1 to 5.

boxplot(traveltime, goout, famrel, main = "Boxplots of interval variables", at = c(1,2,3), names = c("travel time","time with friends", "family relationships"), las = 2, col = c("blue","violet","green"))

The “goout” variable seems to be distributed normally, although two others are skewed in different directions. Next, we will plot each variable against the response variable and explore the dataset further.

library(GGally)
cross <- ggpairs(alc, mapping = aes(col = sex), lower = list(combo = wrap("facethist", bins = 20)))
cross

The variables have very low correlation values between themselves, so there is no risk of covariance tamperinf with the model. Now onto the model fitting.

m <- glm(high_use ~ absences + traveltime + goout + famrel, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + traveltime + goout + famrel, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0083  -0.7805  -0.5437   0.8464   2.3984  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.97770    0.68176  -4.368 1.26e-05 ***
## absences     0.07559    0.02202   3.432 0.000598 ***
## traveltime   0.43957    0.17532   2.507 0.012167 *  
## goout        0.76735    0.12087   6.348 2.18e-10 ***
## famrel      -0.36294    0.13756  -2.638 0.008331 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 389.46  on 377  degrees of freedom
## AIC: 399.46
## 
## Number of Fisher Scoring iterations: 4

All of the explanatory variables have a statistically significant relationship with the target variable.
Next, we will calculate the odds ratios and confidence intervals to validate our model.

odds_ratios <- coef(m) %>% exp
conf_int <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(odds_ratios, conf_int)
##             odds_ratios      2.5 %    97.5 %
## (Intercept)   0.0509099 0.01285711 0.1879144
## absences      1.0785172 1.03437018 1.1291483
## traveltime    1.5520466 1.10198801 2.1968917
## goout         2.1540600 1.71012923 2.7497327
## famrel        0.6956280 0.52984131 0.9103098

The odds of high alcohol consumption for: 1. absent students is from 1.03 and 1.12 higher than for attending students.
2. students with shorter travel time to school is 1.10 to 2.19 higher than for those who have to take the long road.
3. students who spend a lot of time with their friends is from 1.7 to 2.7 times higher for students who do not.
4. students with a good family situation is between 52% and 91% of the odds of students with bad family relationships.

To continue fitting the model, we will compare predicted and actual values.

# predict the probability of high_use
probabilities <- predict(m, type = "response")

alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = (probability > 0.5))
select(alc, absences, traveltime, goout, famrel, high_use, probability, prediction) %>% head(10)
##    absences traveltime goout famrel high_use probability prediction
## 1         5          2     4      4    FALSE  0.47428353      FALSE
## 2         3          1     3      5    FALSE  0.13895457      FALSE
## 3         8          1     2      4     TRUE  0.13581670      FALSE
## 4         1          1     2      3    FALSE  0.11746601      FALSE
## 5         2          1     2      4    FALSE  0.09079210      FALSE
## 6         8          1     2      5    FALSE  0.09855191      FALSE
## 7         0          1     4      4    FALSE  0.28486278      FALSE
## 8         4          2     4      4    FALSE  0.45548223      FALSE
## 9         0          1     2      4    FALSE  0.07906088      FALSE
## 10        0          1     1      5    FALSE  0.02697575      FALSE
prediction_plot <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
prediction_plot + geom_point()

table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.64921466 0.05235602 0.70157068
##    TRUE  0.18586387 0.11256545 0.29842932
##    Sum   0.83507853 0.16492147 1.00000000

Not totally amazing, but still. The penultimate step would be to calculate the accuracy of the model.

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the training data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2382199

Quite a good result!
Finally, we perform the 10-fold cross-validation.

library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2329843

Sliiiiightly better than 0.26 by the DataCamp model.


Chapter 4. Clustering and classification

Data wrangling

The first thing in today’s agenda is to load the Boston dataset from r.

# loading
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

# initial dataset exploration
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

This dataset contains 506 observations over 14 variables, 2 of them interval and the other ones numerical. The indices were measured across the following variables:
1. ‘crim’ (per capita crime rate by town)
2. ‘zn’ (proportion of residential land zoned for lots over 25,000 sq.ft)
3. ‘indus’ (proportion of non-retail business acres per town)
4. ‘chas’ (Charles River dummy variable (= 1 if tract bounds river; 0 otherwise))
5. ‘nox’ (nitrogen oxides concentration (parts per 10 million)) 6. ‘rm’ (average number of rooms per dwelling)
7. ‘age’ (proportion of owner-occupied units built prior to 1940) 8. ‘dis’ (weighted mean of distances to five Boston employment centres)
9. ‘rad’ (index of accessibility to radial highways) 10. ‘tax’ (full-value property-tax rate per $10,000) 11. ‘ptratio’ (pupil-teacher ratio by town)
12. ‘black’ (1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town)
13. ‘lstat’ (lower status of the population (percent))
14. ‘medv’ (median value of owner-occupied homes in $1000s)

The variables have very different ranges, which probably means standartisation before the analysis. Now onto a graphical overview.

pairs(Boston)

A lot of variable seem to be strongly correlated. The plot for ‘nox’ and ‘dis’, for instance, are seemingly related through 1/x function. It probably makes sense since air pollution is dissipating the farther you from the city center.

To check whether this is just an observation or a valid hypothesis, let’s check for correlations.

# calculate the correlation matrix and round it
cor_matrix <- cor(Boston)
# round up to 2 digits
cor_matrix_r <- cor_matrix
round(cor_matrix_r, 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
# visualize
library(corrplot)
## corrplot 0.84 loaded
corrplot(cor_matrix_r, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

The hypothesis is correct, the correlation coefficient is -0.77.

Data preparation

For classification and clustering purposes, next step would be to scale the dataset to avoid skewing results.

# scaling around 0
boston_scaled <- scale(Boston)

# getting a matrix as a result
class(boston_scaled)
## [1] "matrix"
# transforming the matrix into a new dataset
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Now the variables are on the same scale with a mean of 0. In order to work with classification, we need a binary or multiclass target variable. In this instance, we are most interested in classifying neighbourhoods accroding to crime rates, that is why we will next transform the ‘crim’ variable into categorical one via quantiles.

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low","med_low","med_high","high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

We are done with data preparation and now can go on separating the test and training sets for further model training and validation.

# determine the number of rows in the dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create a train set
train <- boston_scaled[ind,]

# create a test set by substraction of the train set
test <- boston_scaled[-ind,]

All the steps in preparing the data are complete.

Linear Discriminant Analysis

Fitting the Linear Discriminant Analysis to the train data:

lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2500000 0.2500000 0.2475248 
## 
## Group means:
##                  zn      indus          chas        nox         rm
## low       1.0283825 -0.9254675 -0.1179329800 -0.8870913  0.5154579
## med_low  -0.0890347 -0.2637405  0.0005392655 -0.5737786 -0.1221328
## med_high -0.3836558  0.1670026  0.0785016464  0.2956172  0.1826823
## high     -0.4872402  1.0171519 -0.0361030535  1.0417446 -0.4651386
##                 age        dis        rad        tax     ptratio
## low      -0.8712921  0.9065474 -0.6868728 -0.7589060 -0.48212247
## med_low  -0.3704369  0.3522783 -0.5543231 -0.4677380 -0.03479791
## med_high  0.4001481 -0.3404595 -0.4030893 -0.3139393 -0.22550538
## high      0.7915808 -0.8456922  1.6377820  1.5138081  0.78037363
##               black       lstat        medv
## low       0.3783065 -0.78310219  0.59028998
## med_low   0.3095684 -0.16135853  0.02323866
## med_high  0.1198860 -0.03287276  0.21830651
## high     -0.7105463  0.89393387 -0.70226522
## 
## Coefficients of linear discriminants:
##                  LD1           LD2         LD3
## zn       0.088637397  0.7705277789 -0.87559252
## indus    0.003806011 -0.2953748054  0.19938155
## chas    -0.052995147 -0.0379711812  0.12389336
## nox      0.434738952 -0.6128463135 -1.34931264
## rm      -0.110224811 -0.1074986266 -0.21470470
## age      0.228428598 -0.2788587976 -0.31931173
## dis     -0.049867309 -0.2294623006  0.06557595
## rad      3.192061371  0.9594481947 -0.16939902
## tax     -0.032933776 -0.0601589090  0.73622549
## ptratio  0.125103415  0.0154585653 -0.21589998
## black   -0.100724641  0.0001947518  0.10132264
## lstat    0.240022264 -0.2013371225  0.30842566
## medv     0.170982945 -0.3182322020 -0.19622623
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9500 0.0365 0.0135

The mean group results demonstrate that the three variables that are mostly influencing high crime rates are the index of accessibility to radial highways (1.63), full-value property-tax rate per $10,000 (1.51), and, interestingly, pollution rates (1.054).

The linear discriminants coefficient confirm those findings for the radial highways accessaibility which is three times higher than all the other variables in the dataset.

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "black", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 3)

The plot confirms the prediction for the biggest influence of accessibility to radial highways.

Prediction

Next, we will test the model that we have fitted on the test set, which we have previously separated from the data.

# save the correct classes from test data
correct_classes <- test$crime

# remove this variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       12      12        1    0
##   med_low    4      16        5    0
##   med_high   0       4       20    1
##   high       0       0        0   27

The model looks quite adequate, there is a just a small number of mismatches in the classification.

K-means clustering

Next, after trying to assign the measurements to previously determined classes, now we well turn to unsupervised methods. K-clustering begins, as usual, with loading and scaling data.

# loading
library(MASS)
data("Boston")

# scaling around 0
boston_scaled <- scale(Boston)

# getting a matrix as a result
class(boston_scaled)
## [1] "matrix"
# transforming the matrix into a new dataset
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

After the standartisation is complete, we need to calculate the distances for the scaled data.

# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Next, we will try out a random number of clusters.

# k-means clustering
km <-kmeans(boston_scaled, centers = 10)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

The plot looks very colourful, but is is obvious that the number of clusters is too big. Right now we will reduce it in half.

# k-means clustering
km <-kmeans(boston_scaled, centers = 5)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

Looks better, but still confusing. It it time to more mathematical methods of identifying the right number of clusters for this dataset.

set.seed(123)

# determine the maximum number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
library(ggplot2)
qplot(x = 1:k_max, y = twcss, geom = 'line')

According to the total sum of squares plot, the biggest observable “elbow” is set at 2. Therefore, we will run the k-means analysis with 2 centroids.

# k-means clustering
km <-kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

It turns out, that the most fitting number of classes for this dataset was just 2. A bit underwhelming though.


Chapter 5. Dimensionality reduction techniques

Data exploration

human <- read.csv("./data/human.csv", row.names = 1)
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ sed_ratio  : num  0.993 1.003 1.017 1.012 1.032 ...
##  $ labor_ratio: num  0.891 0.819 0.825 0.884 0.829 ...
##  $ ed_exp     : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ lifeexp    : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ gni        : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ matmort    : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ teenbirths : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ parl       : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
##    sed_ratio       labor_ratio         ed_exp         lifeexp     
##  Min.   :0.6681   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:1.0032   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :1.0667   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :1.3523   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:1.3766   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :5.8235   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       gni            matmort         teenbirths          parl      
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
library(GGally)
pairs <- ggpairs(human)
pairs

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ tibble  2.1.3     ✔ purrr   0.3.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ MASS::select()  masks dplyr::select()
library(corrplot)
# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot()

All of the variables differ dramatically in their ranges. Two of them, GNI per capita and maternal mortality right are significantly skewed to the right; others seem to be distributed more or less normally.

This dataframe contains 155 observations and 8 variables, 6 of them numerical and two interval. Based on their distributions, we can see that it’s a highly intercorrelated dataset, which is perfect for the purposes of the Principal Component Analysis.

Some of the variables in the data are strongly positively or negatively correlated: for instance, maternal mortality quite logically has a strong positive correlation with adolescent women giving birth. On the other hand, maternal mortality is strongly negatively correlated with expected education.

PCA on non-standardized data

First, let’s try Principal Component Analysis on non-scaled data.

pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855223e+02 2.518704e+01 1.145442e+01 3.766254e+00
## [6] 1.569207e+00 5.345221e-01 1.720831e-01
## 
## Rotation (n x k) = (8 x 8):
##                       PC1           PC2           PC3           PC4
## sed_ratio    1.438861e-05 -0.0022503438  0.0016047293 -1.006305e-03
## labor_ratio  2.331945e-07 -0.0002819336 -0.0005303524  4.692746e-03
## ed_exp      -9.562910e-05  0.0075529685 -0.0142769848  3.313605e-02
## lifeexp     -2.815823e-04  0.0283149570 -0.0129494322  6.752623e-02
## gni         -9.999832e-01 -0.0057723207  0.0005156755 -4.932869e-05
## matmort      5.655734e-03 -0.9916297159 -0.1260337001  6.102337e-03
## teenbirths   1.233961e-03 -0.1255500502  0.9918095255 -5.299293e-03
## parl        -5.526460e-05  0.0032317269  0.0073979127  9.971228e-01
##                       PC5           PC6           PC7           PC8
## sed_ratio    0.0034964012 -7.260101e-02 -9.945465e-01 -7.473545e-02
## labor_ratio  0.0022140041  3.308785e-02  7.249269e-02 -9.968063e-01
## ed_exp       0.1430816370  9.858861e-01 -7.363222e-02  2.784972e-02
## lifeexp      0.9865668049 -1.447742e-01  1.397970e-02 -1.280901e-03
## gni         -0.0001135872 -2.707883e-05  1.066685e-06 -1.814719e-07
## matmort      0.0266316436  1.829585e-03  1.946440e-03  6.376929e-04
## teenbirths   0.0188547462  1.284169e-02  1.016751e-03  2.495532e-05
## parl        -0.0716358797 -2.313031e-02  1.488159e-04  3.773314e-03
# rounded percentages of variance captured by each PC
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2,], digits = 1)

# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "yellow"), xlab = pc_lab[1], ylab = pc_lab[2], main = (title = "PCA_non-scaled"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

We can immediately see from the summary of the model and the plot that the first component takes on 100% of the variance. This is due to the difference in ranges of the variables. For instance, the GNI per capita is represented by the longest axis, clearly has the biggest standard deviation. All the arrows are sitting on the same axis as if they are fully correlated.

Therefore, this analysis cannot be really taken into account.

PCA on standardized data

# standardize the variables
human_std <- scale(human)
summary(human_std)
##    sed_ratio         labor_ratio          ed_exp           lifeexp       
##  Min.   :-0.92829   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.47369   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median :-0.38753   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.03306   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 6.06675   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       gni             matmort          teenbirths           parl        
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
pca_human_std <- prcomp(human_std)
pca_human_std
## Standard deviations (1, .., p=8):
## [1] 2.0686801 1.1395786 0.8731156 0.8198145 0.6157250 0.5399602 0.4589378
## [8] 0.3258813
## 
## Rotation (n x k) = (8 x 8):
##                     PC1         PC2         PC3         PC4        PC5
## sed_ratio    0.35428486 -0.03349747  0.26152597 -0.62161810  0.6060082
## labor_ratio  0.05429201  0.72405070 -0.60003871  0.01397768  0.2272492
## ed_exp      -0.42790065  0.14040242 -0.07383829 -0.09101094  0.1679844
## lifeexp     -0.44475461 -0.02380824  0.10661368 -0.05928025  0.1612152
## gni         -0.34485721  0.04901234 -0.12633930 -0.73711731 -0.5106521
## matmort      0.43894896  0.14353662 -0.10075255 -0.22918529 -0.1688594
## teenbirths   0.41461440  0.07478342  0.04257017  0.01639864 -0.4720253
## parl        -0.08438722  0.65249620  0.72581969  0.07390325 -0.1217201
##                     PC6          PC7         PC8
## sed_ratio   -0.15511613 -0.063047732 -0.15255603
## labor_ratio -0.04403599 -0.230781162 -0.07562945
## ed_exp      -0.37880046  0.780514915 -0.05040850
## lifeexp     -0.41364089 -0.430761634  0.63568952
## gni          0.14453049 -0.125313539 -0.14835638
## matmort      0.20112665  0.359359700  0.72521669
## teenbirths  -0.76068241 -0.055924672 -0.12588884
## parl         0.13940058  0.005956103 -0.02382852
# rounded percentages of variance captured by each PC
s_st <- summary(pca_human_std)
pca_pr_st <- round(100*s_st$importance[2,], digits = 1)

# create object pc_lab to be used as axis labels
pc_lab_st <- paste0(names(pca_pr_st), " (", pca_pr_st, "%)")

# draw a biplot of the principal component representation and the original variables
biplot(pca_human_std, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "green"), xlab = pc_lab_st[1], ylab = pc_lab_st[2], main = (title = "PCA_scaled"))

The current analysis looks much more reliable. The countries are distibured throughout the bidimensional space. The first two principal components add up to 69.7% of the variance, typical for PCA.

Speculating about how to call the two dimensions, I would suggest “welfare” for PC1, because it collects indicators of health, social protection, and economic growth, and “participation” for PC2, since it is dealing with workforce and the participation rate. In the former, the variables mostly contributing to it are life expectancy at birth, GNI per capita, maternal maternity ratio, adolescent birth rate, expected years of education, and population with secondary education ratio. For the latter, the two main factors are representation in parliament and labor force prticipation ratio. Judging by the angles between the arrows along the two axis, the variables are closely correlated with each other. The length of the arrows became much more equal after the standardisation.

Multiple correspondence analysis

library(FactoMineR)
data(tea)

dim(tea)
## [1] 300  36
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255  
##  Not.feminine:171   sophisticated    :215   slimming   : 45  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##         exciting          relaxing              effect.on.health
##  exciting   :116   No.relaxing:113   effect on health   : 66    
##  No.exciting:184   relaxing   :187   No.effect on health:234    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
## 
library(ggplot2)
gather(tea) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

The “tea” dataset contains 300 observations and 36 variables pertaining to people’s tea-taking habits, most of which are strings. Some follow a bimodal distributions, some not.

The current dataframe is too big for a meaningful MCA analysis to my taste, so I chose nine to my best interest.

# creating a subset
keep_columns <- c("Tea", "How", "how", "sugar", "home", "breakfast", "age", "sex", "frequency")
tea_time <- dplyr::select(tea, one_of(keep_columns))
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                                                                     
##                                                                     
##        home             breakfast        age        sex    
##  home    :291   breakfast    :144   Min.   :15.00   F:178  
##  Not.home:  9   Not.breakfast:156   1st Qu.:23.00   M:122  
##                                     Median :32.00          
##                                     Mean   :37.05          
##                                     3rd Qu.:48.00          
##                                     Max.   :90.00          
##        frequency  
##  1/day      : 95  
##  1 to 2/week: 44  
##  +2/day     :127  
##  3 to 6/week: 34  
##                   
## 
str(tea_time)
## 'data.frame':    300 obs. of  9 variables:
##  $ Tea      : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How      : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how      : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar    : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ home     : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ breakfast: Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ age      : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex      : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ frequency: Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
library(ggplot2)
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

A lot of young people and Earl Grey lovers who drink more than two cups a day (just like me).
Prior to performing MCA, we need to transform the variables into factors, because MCA only operates on categorical variables.

# prepare the data for MCA by converting all the variables to factors
colnames(tea_time)
## [1] "Tea"       "How"       "how"       "sugar"     "home"      "breakfast"
## [7] "age"       "sex"       "frequency"
tea_time$Tea <- factor(tea_time$Tea)
tea_time$How <- factor(tea_time$How)
tea_time$how <- factor(tea_time$how)
tea_time$sugar <- factor(tea_time$sugar)
tea_time$home <- factor(tea_time$home)
tea_time$breakfast <- factor(tea_time$breakfast)
tea_time$age <- factor(tea_time$age)
tea_time$sex <- factor(tea_time$sex)
tea_time$frequency <- factor(tea_time$frequency)

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.234   0.205   0.191   0.190   0.178   0.170
## % of var.              2.843   2.491   2.324   2.309   2.164   2.067
## Cumulative % of var.   2.843   5.334   7.658   9.968  12.132  14.199
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.162   0.157   0.153   0.150   0.147   0.139
## % of var.              1.973   1.914   1.866   1.825   1.793   1.687
## Cumulative % of var.  16.172  18.086  19.952  21.776  23.569  25.257
##                       Dim.13  Dim.14  Dim.15  Dim.16  Dim.17  Dim.18
## Variance               0.130   0.125   0.111   0.111   0.111   0.111
## % of var.              1.575   1.522   1.351   1.351   1.351   1.351
## Cumulative % of var.  26.832  28.354  29.705  31.056  32.408  33.759
##                       Dim.19  Dim.20  Dim.21  Dim.22  Dim.23  Dim.24
## Variance               0.111   0.111   0.111   0.111   0.111   0.111
## % of var.              1.351   1.351   1.351   1.351   1.351   1.351
## Cumulative % of var.  35.110  36.462  37.813  39.164  40.516  41.867
##                       Dim.25  Dim.26  Dim.27  Dim.28  Dim.29  Dim.30
## Variance               0.111   0.111   0.111   0.111   0.111   0.111
## % of var.              1.351   1.351   1.351   1.351   1.351   1.351
## Cumulative % of var.  43.218  44.570  45.921  47.272  48.624  49.975
##                       Dim.31  Dim.32  Dim.33  Dim.34  Dim.35  Dim.36
## Variance               0.111   0.111   0.111   0.111   0.111   0.111
## % of var.              1.351   1.351   1.351   1.351   1.351   1.351
## Cumulative % of var.  51.327  52.678  54.029  55.381  56.732  58.083
##                       Dim.37  Dim.38  Dim.39  Dim.40  Dim.41  Dim.42
## Variance               0.111   0.111   0.111   0.111   0.111   0.111
## % of var.              1.351   1.351   1.351   1.351   1.351   1.351
## Cumulative % of var.  59.435  60.786  62.137  63.489  64.840  66.191
##                       Dim.43  Dim.44  Dim.45  Dim.46  Dim.47  Dim.48
## Variance               0.111   0.111   0.111   0.111   0.111   0.111
## % of var.              1.351   1.351   1.351   1.351   1.351   1.351
## Cumulative % of var.  67.543  68.894  70.245  71.597  72.948  74.300
##                       Dim.49  Dim.50  Dim.51  Dim.52  Dim.53  Dim.54
## Variance               0.111   0.111   0.111   0.111   0.111   0.111
## % of var.              1.351   1.351   1.351   1.351   1.351   1.351
## Cumulative % of var.  75.651  77.002  78.354  79.705  81.056  82.408
##                       Dim.55  Dim.56  Dim.57  Dim.58  Dim.59  Dim.60
## Variance               0.111   0.111   0.111   0.111   0.111   0.111
## % of var.              1.351   1.351   1.351   1.351   1.351   1.351
## Cumulative % of var.  83.759  85.110  86.462  87.813  89.164  90.516
##                       Dim.61  Dim.62  Dim.63  Dim.64  Dim.65  Dim.66
## Variance               0.082   0.076   0.072   0.069   0.066   0.061
## % of var.              0.997   0.929   0.877   0.842   0.802   0.738
## Cumulative % of var.  91.513  92.442  93.319  94.161  94.963  95.700
##                       Dim.67  Dim.68  Dim.69  Dim.70  Dim.71  Dim.72
## Variance               0.058   0.055   0.051   0.046   0.043   0.042
## % of var.              0.711   0.665   0.619   0.554   0.521   0.507
## Cumulative % of var.  96.411  97.076  97.696  98.249  98.770  99.277
##                       Dim.73  Dim.74
## Variance               0.032   0.027
## % of var.              0.391   0.332
## Cumulative % of var.  99.668 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.026  0.001  0.000 | -0.762  0.946  0.033 |  0.936
## 2                  |  0.573  0.469  0.076 | -0.306  0.152  0.022 |  0.581
## 3                  |  0.357  0.182  0.021 |  0.330  0.178  0.018 | -0.328
## 4                  | -0.480  0.328  0.066 | -0.396  0.256  0.045 | -0.068
## 5                  | -0.013  0.000  0.000 |  0.083  0.011  0.001 |  0.398
## 6                  | -0.233  0.077  0.023 | -0.178  0.052  0.013 | -0.071
## 7                  | -0.035  0.002  0.000 |  0.257  0.107  0.012 | -0.021
## 8                  |  0.007  0.000  0.000 |  0.446  0.324  0.031 |  0.183
## 9                  |  0.412  0.243  0.029 | -0.329  0.176  0.018 |  0.270
## 10                 |  0.550  0.432  0.058 |  0.313  0.159  0.019 |  0.239
##                       ctr   cos2  
## 1                   1.529  0.050 |
## 2                   0.589  0.078 |
## 3                   0.188  0.018 |
## 4                   0.008  0.001 |
## 5                   0.276  0.026 |
## 6                   0.009  0.002 |
## 7                   0.001  0.000 |
## 8                   0.058  0.005 |
## 9                   0.127  0.012 |
## 10                  0.100  0.011 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.750   6.586   0.184   7.416 |   0.649   5.630
## Earl Grey          |  -0.217   1.437   0.085  -5.034 |  -0.323   3.632
## green              |  -0.413   0.891   0.021  -2.510 |   0.432   1.115
## alone              |  -0.154   0.731   0.044  -3.625 |   0.231   1.886
## lemon              |  -0.264   0.363   0.009  -1.602 |  -0.186   0.207
## milk               |   0.306   0.932   0.025   2.725 |  -0.838   8.008
## other              |   2.160   6.653   0.144   6.569 |   1.541   3.863
## tea bag            |  -0.158   0.675   0.033  -3.129 |  -0.221   1.499
## tea bag+unpackaged |   0.357   1.893   0.058   4.164 |  -0.041   0.029
## unpackaged         |  -0.184   0.192   0.005  -1.172 |   1.151   8.626
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.138   6.418 |   0.671   6.463   0.148   6.642 |
## Earl Grey            0.188  -7.492 |  -0.431   6.944   0.335 -10.006 |
## green                0.023   2.628 |   1.014   6.582   0.127   6.167 |
## alone                0.099   5.449 |  -0.193   1.413   0.069  -4.556 |
## lemon                0.004  -1.131 |   0.192   0.237   0.005   1.169 |
## milk                 0.187  -7.474 |   0.559   3.817   0.083   4.985 |
## other                0.073   4.685 |  -0.430   0.322   0.006  -1.307 |
## tea bag              0.064  -4.367 |   0.030   0.030   0.001   0.595 |
## tea bag+unpackaged   0.001  -0.484 |  -0.377   2.588   0.065  -4.403 |
## unpackaged           0.181   7.350 |   0.842   4.948   0.097   5.377 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.188 0.191 0.344 |
## How                | 0.183 0.257 0.100 |
## how                | 0.058 0.187 0.130 |
## sugar              | 0.199 0.068 0.003 |
## home               | 0.172 0.010 0.007 |
## breakfast          | 0.271 0.174 0.007 |
## age                | 0.557 0.629 0.695 |
## sex                | 0.067 0.009 0.353 |
## frequency          | 0.409 0.317 0.081 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

The MCA resulted in a quite a busy graph with not so great a level of variance coverage: just 2.84% for the first dimension and 2.49% for the second. Converting the whole age variable into a factor was not a brilliant idea either; I should have split it into more meaningful bins accroding to age groups.
However, we can see some clusters: people who prefer Earl Grey love taking lemon and sugar with their drink once a day. Black-tea lovers drink it sugarless and twice a day or more. Finally, unfrequent tea-takers (from 1 to 6 times a week) lean towards drinking it not at breakfast, not at home, and without any of the fancy stuff.